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ABSTRACT 

Motivation: Polyadenylation is the addition of a poly(A) tail to an RNA 
molecule. Identifying DNA sequence motifs that signal the addition of 
poly(A) tails is essential to improved genome annotation and better 
understanding of the regulatory mechanisms and stability of mRNA. 
Existing poly(A) motif predictors demonstrate that information ex- 
tracted from the surrounding nucleotide sequences of candidate 
poly(A) motifs can differentiate true motifs from the false ones to a 
great extent. A variety of sophisticated features has been explored, 
including sequential, structural, statistical, thermodynamic and evolu- 
tionary properties. However, most of these methods involve extensive 
manual feature engineering, which can be time-consuming and can 
require in-depth domain knowledge. 

Results: We propose a novel machine-learning method for poly(A) 
motif prediction by marrying generative learning (hidden Markov 
models) and discriminative learning (support vector machines). 
Generative learning provides a rich palette on which the uncer- 
tainty and diversity of sequence information can be handled, while 
discriminative learning allows the performance of the classification 
task to be directly optimized. Here, we used hidden Markov 
models for fitting the DNA sequence dynamics, and developed an 
efficient spectral algorithm for extracting latent variable informa- 
tion from these models. These spectral latent features were then fed 
into support vector machines to fine-tune the classification 
performance. 

We evaluated our proposed method on a comprehensive human 
poly(A) dataset that consists of 14740 samples from 12 of the 
most abundant variants of human poly(A) motifs. Compared with 
one of the previous state-of-the-art methods in the literature 
(the random forest model with expert-crafted features), our method 
reduces the average error rate, false-negative rate and false-positive 
rate by 26, 15 and 35%, respectively. Meanwhile, our method 
makes ~30% fewer error predictions relative to the other string 
kernels. Furthermore, our method can be used to visualize the 
importance of oligomers and positions in predicting poly(A) motifs, 
from which we can observe a number of characteristics in the sur- 
rounding regions of true and false motifs that have not been reported 
before. 
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Supplementary information: Supplementary data are available at 
Bioinformatics online. 
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1 INTRODUCTION 

Roughly speaking, when DNA is transcribed to RNA, a string of 
adenine (A) nucleotides, referred to as the polyadenylation tail or 
the poly(A) tail, is added to the 3 r -end of the primary RNA 
transcript. Such a process is called polyadenylation, which is a 
step to protect RNA stability, nuclear export and translation 
(Bernstein and Ross, 1989; Leung et al, 2011). A poly(A) tail 
is ~10-30nt downstream of a signaling site that consists of 6nt, 
which in human cells most commonly is AAV AAA (Beaudoing 
et al, 2000). This signaling site is known as a poly(A) signal and 
the corresponding 6 nt subsequences in DNA are called poly(A) 
motifs. Mutations in poly(A) signals can be associated with dis- 
eases, e.g. colorectal cancer (Kim et al, 2002; Pastrello et al, 
2006) and compromised immunodeficiency (Das et al., 1997; 
Langemeier et al., 2012). 

The poly(A) signal prediction problem has been studied for 
decades (Proudfoot, 2011). There are two versions of the prob- 
lem: predicting poly(A) signals in mRNA sequences and predict- 
ing poly(A) motifs in DNA sequences. Intuitively, the former 
version is much simpler than the latter one because once the 
mRNA sequence is given, the poly(A) tail can be identified rela- 
tively easily and we need only to search for the poly(A) signal(s) 
in a window of at most 30 nt upstream the poly(A) tail. In DNA, 
however, the presence of introns in eukaryotes and the absence 
of poly(A) tails make the recognition much more challenging. 

A number of studies have demonstrated that information from 
relatively short upstream and downstream sequences of the can- 
didate poly(A) motifs can specify the true poly(A) motifs to a 
great extent (Ahmed et al, 2009; Akhtar et al, 2010; Chang 
et al, 2011; Cheng et al, 2006; Graber et al, 1999; Ji et al, 
2010; Kalkatawi et al, 2012; Legendre and Gautheret, 2003; 
Liu et al, 2005; Salamov and Solovyev, 1997; Tabaska and 
Zhang, 1999). Statistical properties of the surrounding sequences 
were explored in different species, such as yeast (van Helden 
et al, 2000), fly (Retelska et al, 2006), Arabidopsis and rice (Ji 
et al, 2010) and human (Chang et al, 201 1; Retelska et al, 2006; 
Tabaska and Zhang, 1999). Although significant progress has 
been made to the accuracy of poly(A) motif predictors, especially 
in human DNA sequences, such methods are all based on using 
sophisticated features that require additional efforts to extract 
and are highly dependent on domain knowledge. We therefore 
ask the following question: can we use machine-learning tech- 
niques to automatically extract useful features from human 
DNA sequences and achieve state-of-the-art poly(A) motif clas- 
sification results? If the answer to this question is yes, then we 
can automate this genome annotation task to a great extent and 
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potentially speed up our understanding of the underlying 
biology. 

Automatic feature extraction techniques have been explored in 
many other sequence classification problems. By far, the most 
successful methods are hidden Markov models (HMMs), e.g. in 
gene finding (Lukashin and Borodovsky, 1998; Stanke and 
Waack, 2003), and string kernels with support vector machines 
(SVM), e.g. in protein classification (Leslie et aL, 2004), tran- 
scription start- site recognition (Sonnenburg et aL, 2006) and 
splice- site prediction (Sonnenburg et aL, 2007). The former 
methods are generative models that capture the uncertainty in 
data using probabilistic languages, while the latter are discrim- 
inative methods that optimize specifically for the classification 
results. The advantages of each class of methods have rarely 
been combined systematically to yield even better feature 
extractors. 

In this article, we propose a novel method for poly(A) motif 
prediction by marrying generative learning (HMMs) and dis- 
criminative learning (SVMs). Generative learning provides us 
with a rich palette for handling the uncertainty and diversity of 
sequence information, while discriminative learning allows us to 
directly optimize performance for the classification task. Here, 
diversity means that for the same position in the surrounding 
sequence of a candidate poly(A) motif, there may be multiple 
subsequences indicating the class label (a true motif or a false 
one), and uncertainty means that each of these subsequences does 
not give deterministic information about the class label. In par- 
ticular, we use HMMs as a probabilistic generative model for 
DNA sequences, and develop an efficient spectral algorithm for 
extracting latent variable information from these models. The 
HMMs model not only the diversity and uncertainty of sequence 
information, but also long-range dependencies between subse- 
quences at different positions, which cannot be simultaneously 
captured by string kernels as is discussed in Section 2.2. The 
spectral latent features are then fed into support vector machines 
to fine-tune the classification performance. 

2 RELATED WORKS 

We will first review the two classes of methods, expert-crafted 
features and string kernels, for addressing the poly(A) motif clas- 
sification problem. 

2.1 Features based on domain knowledge 

Bioinformatics experts can craft highly informative features 
based on prior knowledge of the biology and physics of DNA 
sequences. The drawback of using these features is that they re- 
quire extensive expert domain knowledge for their design as well 
as additional efforts to extract them. 

Salamov and Solovyev developed POLY AH (Salamov and 
Solovyev, 1997), a tool that used a linear discriminant func- 
tion-based classifier to extract features from lOOnt upstream 
and 200 nt downstream of a candidate poly(A) motif. Later on, 
polyadq was developed based on quadratic discriminant func- 
tions by encoding features from 100 nt downstream only 
(Tabaska and Zhang, 1999). Several support vector machine- 
based predictors were then proposed and they performed well 
on recognizing true poly(A) motifs. Such methods include 



DNAFSMiner, which was based on both 100 nt upstream and 
downstream (Liu et aL, 2005), Polya_svm, which encoded cis- 
regulatory element features (Cheng et aL, 2006), and polyApred, 
which extracted features from lOOnt upstream and downstream 
of the candidate poly(A) motifs (Ahmed et aL, 2009). In 2010, 
Akhtar et al. proposed POLYAR (Akhtar et aL, 2010), a linear 
discriminant analysis-based method that used features from 
300 nt upstream and downstream of the candidate poly(A) 
motifs. 

Recently, Kalkatawi et al. developed an artificial neural net- 
work (ANN)-based method and a random forest (RF)-based 
method to predict poly(A) motifs in human DNA sequences 
(Kalkatawi et aL, 2012). Their methods were based on a variety 
of expert-crafted features, such as thermodynamic and structural 
features of dinucleotides, electron-ion interaction potentials and 
position-weight matrices of upstream and downstream regions 
relative to the candidate poly(A) motifs. In total, they extracted 
274 features. They compiled a large-scale benchmark set that 
contained 14 740 sequences for the 12 main variants of human 
poly(A) motifs. The ANN and RF models significantly outper- 
formed all previously reported studies. 

2.2 String kernels 

String kernels are positive definite functions to compute similar- 
ity between two sequences, which is then used in support vector 
machines to learn classifiers. These kernels essentially map se- 
quences into high-dimensional feature spaces corresponding to 
subsequences and then compute inner products between two fea- 
ture vectors. The drawback of string kernels is that they simply 
count raw sequence matches and do not explicitly take into ac- 
count the uncertainty and diversity of sequence information. 
Many string kernels have been designed over the years, but so 
far few have effectively made use of generative models to deal 
with data uncertainty. 

More specifically, given an alphabet S, here the DNA nucleo- 
tides £ = {A, G, C, T}, let x e Y> k be a sequence of length k 
(or fc-mer). The ^-spectrum (SPE) kernel /c(x, y) counts pairs of 
identical £-mers between two sequences x and y (of length L and 
L' , respectively) independently of their position (Leslie et aL, 
2002): k(x, y) = E£? +1 D{*«h*-i = JV.-m-i }, where 

x t:t+ k-i : = x t x t+ \ . . . x t+ k-i denotes a subsequence of x that 
starts at position t and has length k, and D{-} returns 1 if the 
two /r-mers are the same and otherwise 0. This kernel effectively 
maps each sequence into a feature space where each dimension 
counts the number of occurrences of a particular /c-mer and uses 
the inner product as the similarity between two sequences. To 
take uncertainty and diversity in &-mer features into account, one 
can also heuristically include counts on the mismatch of &-mers 
(Leslie et aL, 2002). 

In contrast to the SPE kernel, the weighted degree (WD) 
kernel explicitly takes into account the absolute positions of 
the /r-mers in the sequence (Ratsch and Sonnenburg, 2004): 
K(x,y) = Ya=\ AEf=/ +1 H{**:f+/-i =yt.t+i-i}- Analogously, one 
can also heuristically incorporate the counts for mismatches in 
the WD kernel to account for a certain degree of sequence un- 
certainty and diversity. 

However, heuristic ways of handling mismatches may result in 
underutilization of the sequence information. A principle way to 
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deal with such uncertainty is to model mismatches as random 
variables. Along this direction, the probability product kernel 
(PPK; Jebara et al, 2004) has been proposed for sequence ana- 
lysis. This kernel also compares sequences of equal length and 
assumes that the absolute position in the sequence carries dis- 
criminative information. The key idea is to fit a probabilistic 
generative model (e.g. HMMs) to each sequence separately, 
and then use an inner product between these generative models 
to define the kernel. As a result, the PPK allows us to combine 
discriminative learning of support vector machines with genera- 
tive modeling of data. 

The seminal work on PPKs has several limitations. First, a 
different HMM is fitted to each sequence, which can lead to 
poorly estimated model parameters and bury the useful signals. 
Second, training HMMs with a traditional expectation maxi- 
mization (EM) algorithm (Dempster et al, 1977) can be time- 
consuming. Third, while the hidden variables model the 'clean' 
signals, they are summed out and not directly used for sequence 
comparison. Fourth, the label information for the training se- 
quences is not used for defining the kernel. Owing to these limi- 
tations, the PPK does not perform as well as the SPE kernel or 
the WD kernel as we show in later experiments. In the following, 
we first describe our method, which also combines generative 
and discriminative learning, but overcomes the above four 
limitations. 

3 METHODS 

The key contribution of our article is a new way of extracting and using 
sequence features for classification problems. Our method fits only two 
HMMs for the entire training set, one for sequences in the positive class 
and another one for those in the negative class. Then, we use the posterior 
distributions of the hidden states from each sequence as our features. 
When learning the parameters of the HMMs, we use an efficient spectral 
algorithm recently proposed in machine learning (Hsu et al, 2012). The 
novel combination of the extracted spectral latent features and support 
vector machines leads to state-of-the-art results in poly(A) motif 
prediction. 

3.1 Sequence latent features 

We use HMMs to take into account the uncertainty and diversity of the 
sequence information and hypothesize that there is a 'clean' poly(A) 
signal hidden in the observed sequences. The fact that each hidden vari- 
able is related to the previously observed positions allows us to accom- 
modate long-range dependencies. We use capital letters to denote random 
variables and lower case letters for their instantiations. 

We combine multiple adjacent nucleotides in a DNA sequence into a 
'mega-observation'. For example, we treat the /c-mer A AT as a single 
observation. Thus, each mega-observation has n = 4 k possible states. 
Essentially, we transform a DNA sequence into a sequence of mega-ob- 
servations using an overlapping sliding window of size k, and then asso- 
ciate each mega-observation with a variable X t . For example, a DNA 
sequence of length JJ is transformed to a sequence of L = L! — k + 1 
mega-observations, with each mega-observation being a &-mer. We esti- 
mate the HMM for these transformed sequences. 

Specifically, an HMM contains a Markov chain of hidden variables 
Qi.l '= Q\ • • • Qt • • • Ql that generates the observed sequence of variables 
X\- L =X\ . . . X t . . . X L . Let m and n denote the number of states for the 
hidden and observed variables, respectively. We can fully specify an 
HMM by an m x m transition probability matrix of the hidden variables, 
with the (i, y')-th entry Ty = Pv(Q t+ \ = i\Q t = j), an n x m emission 



probability matrix Oy = Pv(X t = i\Q t = j) and an m dimension prior dis- 
tribution vector over the hidden states Ttf = Pr((?i = 0- With these model 
parameters, we can compute the joint distribution of the hidden and 
observed variables as Pr(X l:L , Q l:L ) = Pr(0O Y\ti MQt+i \Qt) Uti 
Pr(X t \Q t ) and the distribution of the observed variables as 

MXl:L) = E qi:L MXl:L,Ql:L). 

We define the latent feature at position / of the sequence as the pos- 
terior distribution of the hidden variable Q t , given the sequence up to 
position 

= MQt = i\xi :t ) (1) 

This is also called the forward belief of the hidden variable Q t , which 
captures the uncertainty about the 'clean' signal given the observed se- 
quence up to position t (We can easily extend the approach to use the 
posterior distribution of Q t given the entire sequence x as features by 
involving both a forward and a backward recursion.) Note that f t is an m- 
dimensional vector that sums to 1. The latent features for the entire se- 
quence according to the HMM are thus the concatenation of features at 
all positions, i.e. f = (f^, . . . , f^) T . Calculating f t requires us to margin- 
alize over all previous hidden variables Q\, ... , Q t -i, which can be car- 
ried out in a recursive fashion: 

¥r(q t \xi :t ) oc V Fr(x t \q t )Fr(q t \q t -i)Fr(q t -i\xij-i) 

q, ~ x (2) 

= Ylq^ °Xt,q t T qt,q t -i Pr fc-1 \X\:t-\) 

Reexpressing the above relation in matrix form gives 

f,ocTdiag(0 r ,,.)f,_i (3) 

where 0 Xu . denotes the x r th row of the emission probability matrix O. 
This matrix multiplication effectively implements the marginalization 
over variable Q t -\. 

Let S Xt := Tdiag(0. Y ^.) and e be a vector of all ones of size m. 
Equation (3) suggests a recursive algorithm that efficiently extracts the 
latent features. 

f 0 :=7r, f t := , W=1,...,L (4) 

which requires 0(Lm 2 ) time, once we have learned the HMM parameters, 
i.e. the set of quantities {n, T, O}. We note that we learn two HMMs 
{7T + ,T + ,0 + } and {7T _ ,T _ ,0 - } from training sequences, one for those 
with positive labels and the other for those with negative labels. Then for 
a new test sequence, we can extract two latent feature vectors, f*" and f~, 
and use the concatenation of these two features f := (i^ T ,f~ T ) T as the 
sequence features. 

One straightforward but computationally intensive way to learn the 
HMM is by using the EM. However, EM has the problem of slow con- 
vergence and convergence only to a local minimum. These two problems 
will affect the efficiency and efficacy of these latent feature extractions. To 
overcome the shortcomings of the EM algorithm, we use a fast and local- 
minimum-free spectral algorithm to learn an alternative parameterization 
of HMMs, which is described next. 

3.2 Efficient spectral algorithm for latent features 

Traditional HMM learning algorithms try to recover the parameters tt, T 
and O. The resulting maximum-likelihood estimation problem is not 
convex and algorithms can only find a local optimum. These parameters 
7i, T and O characterize the relations between hidden and observed vari- 
ables that cannot be directly observed during training and are usually not 
uniquely identifiable. However, we may not need to recover them exactly 
to extract latent features. Instead, it is sufficient to recover them up to 
some invertible transformation if we subsequently learn a linear classifier 
such as a support vector machine. More specifically, suppose matrix A of 
size m x m is invertible. Define the transformed HMM parameters 
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h 0 := Att, hoo := (A -1 ) e, H x := AS x A~ l (5) 
Then we can compute the transformed latent feature as 



since the invertible matrix A cancels with A -1 during matrix multiplica- 
tion. Then, the latent features of the final sequence that we use in our 
experiments are h := (h +T ,h _T ) T by concatenating the transformed fea- 
tures from positive and negative HMMs. It is easy to see that the trans- 
formed feature will achieve the same performance as the original one with 
a linear classifier because the transformation matrix can be incorporated 
into the classifier weight vector. That is, if we learn a binary classifier 
sign(w T f+&), then sign(w T h + b) with w = (A _1 ) T w will achieve the 
same classification accuracy. 

A natural question is how to choose an A such that the transformed 
parameters {ho,hoo,H x } can be easier to estimate without using an EM 
algorithm. To solve this problem, we use a construction of A by Hsu et al. 
(2012), which allows these transformed parameters to be estimated from 
just tri-gram information of the observed sequences. Formally, let 
{X\,X 2 ,X{) be a triple of adjacent variables in the HMM and define 
the marginal probabilities of observation singletons, pairs and triples as 

[ci] f = PrC*! = 0 (7) 
[C 2 , l ] ij = Pr(X 2 = i,X l =j) (8) 

[C 3 ,,, ilj = Pr(JT 3 = i, X 2 =x,Xi=J), \<x<n (9) 

ci is an n dimensional vector, C2, 1 is an n x n matrix and C3 jJCj 1 is a series 
of n x n matrices indexed by x. Hsu et al. (2012) showed that setting 
A = U t O directly links the above three quantities to the transformed 
parameters in (5), where U is the leading m principal left singular vectors 
of C2, 1. Specifically, the transformed parameters can be directly re- 
covered from single sequence statics as 

ho=U T ci, h 00 =(Cj 1 U)t Cl (10) 

H x =U T C 3!X ,i(U T C 2 ,i)t (11) 
where (•)"!" computes the pseudo-inverse of a matrix. 



Algorithm 1 Spectral learning of transformed HMM parameters 

Require: m — the number of hidden states, k — the number of nt to com- 
bine. Ensure: transformed HMM parameters {ho, hoo, H*}. 

1: Transform all training sequences by combining k consecutive nt into 
a 'mega-observation'. 

2: Use all triples (xi,x 2 ,xi) from the transformed sequences to esti- 
mate ci, C2, 1 and C3 ;Xj i according to (12), (13) and (14). 

3: Compute the Singular Value Composition (SVD) of C 2 , 1, and let U 
be the matrix of left singular vectors corresponding to the m largest 
singular values. 

4: Compute transformed model parameters using (10) and (11). 



The learning algorithm for HMMs is summarized in Algorithm 1. It 
first estimates the quantities in (7), (8) and (9) using training data 

UL ^ d=\ t=\ ieZ k 

[C2 < l] * * M^nEE E = ft/)} (13) 



Ji2)EE E H4+2 = (uj)} (14) 

L) d=l t=l i,j,xe& 

where D is the number of training sequences. These estimates are 
subsequently used to compute the transformed HMM model parameters 
according to (10) and (11). The algorithm has two parameters m and 
k that can be tuned by cross-validation. The major computation is 
an SVD of C2, 1 and hence the name 'spectral algorithm'. We note that 
we learn two HMMs, one for the positive class and the other for the 
negative class. Then, we use both HMMs to extract features for each test 
sequence and concatenate the features, which are summarized in 
Algorithm 2. 



Algorithm 2 Spectral latent feature extraction algorithm 

Require: (x\ , . . . , x t ) — a test sequence, k — the number of nt to combine, 
{Iiq ,h+ ,H+} and {hp ,h~ ,H~} — learned HMM models for positive and 
negative classes. Ensure: spectral latent features h. 

1: Transform the input sequence by combining k adjacent nt into a 
'mega- observation'. 

2: For t=l...L, compute h+ = H+h+ 1 /(h+ T H+h+ X 

3: For t = 1 . . . L, compute h~ = H^h^/fh^H^hJLjY 

4: Concatenate features h = (h| T , . . . , h+ T , h^ T , . . . , h~ T ) T . 



3.2.1 Fast implementation The runtime of algorithm 1 is dominated 
by the SVD computation of an n x n matrix C2, 1 , and the memory re- 
quirement is dominated by storing the tri-gram statistics C3 jX; 1 for each x. 
One technical challenge is that n, the number of possible values of 'mega- 
observation', can grow as n = 4 k , exponential in k. It seems, at first sight, 
that we may need to decompose a huge matrix C2, 1 , and the memory 
requirement for C3 jX) i is prohibitively large. However, most entries in 
these matrices are zero because some /r-mers do not exist in the training 
sequences. Moreover, the total number of non-zero entries is at most the 
number of 'mega-observations' in the training set. Taking advantage of 
this property, we can do sparse matrix SVD and store all the tri-gram 
statistics in a sparse matrix, thus facilitating efficient computation and 
manipulation. The computational complexity thus grows linearly with the 
number of 'mega-observations' in the worst case. 

3.3 Visualizing the importance of A>mers and positions 

Besides accurately classifying the sequences, we are also interested in k- 
mers and positions that are most informative for motif classification. 
Sonnenburg et al. (2008) proposed positional oligomer importance matri- 
ces (POIMs) for WD kernels to analyze the importance of substrings in 
different locations of the sequence. Here, we also develop a technique for 
visualizing the importance of the /r-mer at each position t for the classi- 
fication problem based on our spectral latent features. 

Intuitively, we want to use the contribution of the /c-mer at position t 
to the support vector classifier as its importance score. In particular, we 
make use of the margin of a training sequence in the support vector 
machine. For example, if most positive sequences with large positive 
margins all contain fc-mer AAGC at position then this k-mer is import- 
ant for correct classifications. More formally, let the support vector clas- 
sifier learned from our spectral latent features be sign(w T h + b) and let 
the margin corresponding to a sequence be s(x) = w T h(x) + b, where we 
use h(x) to indicate that the features are extracted from sequence x. Then, 
we define the importance of £-mer y\± at position / as 

0L{y\±@i) := Exgs^ ^( x ) Pr ( x fe+£-i = yv.k) (15) 
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That is, given the sequences that have y\± at position /, we compute 
the importance as the weighted sum of the margin of these sequences. In 
practice, we only have a finite number of training sequences, and we 
will use the finite sample average to estimate the importance score. 

That is, ot(y hk @t) - \ T (y!. k @t)\ ExeTfo*®/) where r (m@0 denotes 
the set of training sequences with /c-mer y\± occurring at position /. 

Once we have computed the importance score for every fc-mer at every 
position we can visualize it in a few different ways. One way is to 
visualize scores as a heatmap of /c-mer versus sequence position. For 
longer A;-mers, there are 4 k possible values that cannot be easily visua- 
lized. Instead, we sum the absolute values of all £-mer importance scores 
at each position, as is done in Sonnenburg et al. (2008), and visualize the 
importance score as a function of the position That is, 
«(@0 := Ej> 1:Jk6 E* <*(y\-.k@t). 



4 RESULTS 

4.1 Datasets 

The proposed method was tested on the benchmark set proposed 
in Kalkatawi et al. (2012). The dataset contains 14 740 sequences 
(7370 with true poly(A) motifs — positive samples, and 7370 with 
false poly(A) motifs — negative samples) for the 12 main variants 
of human poly(A) motifs (see Table 1 for these variants and their 
respective sizes). For each variant, the number of positive se- 
quences is equal to the number of negative sequences. 
Furthermore, for each variant, the positive motifs with the sur- 
rounding sequences were extracted from human mRNA and 
mapped back to the human genome, whereas the same numbers 
of negative motifs were randomly selected from human chromo- 
some 21. Each sample is a candidate 6nt poly(A) motif sur- 
rounded by lOOnt upstream and downstream. This represents 
a comprehensive benchmark set for poly(A) motifs in human 
DNA sequences. The goal is to predict which candidate motifs 
are true poly(A) motifs. 



4.2 Experimental settings 

The proposed method was tested on each of the 12 datasets using 
5-fold cross-validation. Each dataset was randomly partitioned 
into five subsets, four of which were used for training and val- 
idation, and the remaining one was used for testing in each fold. 

We then searched for me {2,4, 40} and k e {3, 4, 5, 6, 7} 
using cross-validations (Supplementary Fig. SI). For each par- 
ameter combination and each dataset, two HMMs were learned 
for the positive samples and the negative ones in the training 
data. For each HMM, the parameters, {ho, h^, H A -}, were learned 
by Algorithm 1 . For each training sequence of 206 nt, the spec- 
tral latent feature vectors, (hj , . . . ,hJ 06 _ h+l ) T , were then calcu- 
lated for the positive and negative HMMs and concatenated 
(Algorithm 2). A linear Support Vector Machine (SVM) was 
then trained using these spectral latent features. 

Given a testing sequence of 206 nt, the spectral latent feature 
vectors were extracted by Algorithm 2 using the learned HMMs 
for the same poly(A) motif. The concatenated feature vector was 
then given to the corresponding SVM model to predict whether it 
was a true poly(A) motif or a false one. The grid search for 
different parameters with respect to the training and testing 
errors indicated that the parameter ranges that had highest ac- 
curacy on the training set were k = 4,5,6 and m > 20 
(Supplementary Fig. SI). 

4.3 Comparison to other string kernels 

We first compare the classification performance of the proposed 
method (HMM) with the previous state-of-the-art string kernels, 
namely, the PPK, SPE kernel and WD kernel. The best k for 
these alternative kernels, except PPK, was also searched using 
the cross-validation between three and seven, and we report here 
the best results. In Table 1, all reported errors are the average 
over the 5-fold cross-validation. It can be seen that the WD 



Table 1. Comparison of the error rates of our method (HMM) with PPK, SPE and WD 



Variants Size Error rate (%) False-negative rate (%) False-positive rate (%) 



PPK SPE WD HMM Rel PPK SPE WD HMM Rel PPK SPE WD HMM Rel 



A AT AAA 


5190 




23.08 


23.72 


18.59 


19.45 




21.93 


23.70 


18.54 


15.47 




24.24 


23.74 


18.65 


23.05 


ATT AAA 


2400 


27.13 


20.17 


18.29 


16.21 


19.63 


32.50 


22.83 


21.50 


18.17 


20.44 


21.75 


17.50 


15.08 


14.25 


18.57 


AAGAAA 


1250 


31.28 


14.72 


16.72 


9.36 


36.41 


37.12 


14.08 


19.68 


11.36 


19.32 


25.44 


15.36 


13.76 


7.36 


52.08 


AAAAAG 


1230 


15.04 


13.25 


7.80 


5.45 


58.90 


25.20 


8.94 


8.46 


6.02 


32.73 


4.88 


17.56 


7.15 


4.88 


72.22 


AATACA 


880 


31.48 


18.98 


23.18 


15.34 


19.16 


35.91 


19.55 


30.68 


19.09 


2.33 


27.05 


18.41 


15.68 


11.59 


37.04 


TAT AAA 


780 


29.87 


16.28 


18.46 


11.15 


31.50 


34.36 


22.31 


21.54 


15.64 


29.89 


25.38 


10.26 


15.38 


6.67 


35.00 


ACT AAA 


690 


40.72 


24.35 


30.29 


16.96 


30.36 


43.48 


28.41 


39.42 


20.00 


29.59 


37.97 


20.29 


21.16 


13.91 


31.43 


AGTAAA 


670 


31.19 


20.90 


23.88 


14.33 


31.43 


33.73 


30.75 


25.67 


20.60 


33.01 


28.66 


11.04 


22.09 


8.06 


27.03 


GAT AAA 


460 


25.43 


17.39 


14.13 


9.57 


45.00 


35.22 


21.74 


16.96 


10.43 


52.00 


15.65 


13.04 


11.30 


8.70 


33.33 


A AT AT A 


410 


29.51 


15.85 


18.78 


9.27 


41.54 


31.22 


23.90 


25.85 


14.63 


38.78 


27.80 


7.80 


11.71 


3.90 


50.00 


CAT AAA 


410 


32.68 


18.78 


22.20 


12.68 


32.47 


40.98 


22.93 


27.80 


20.49 


10.64 


24.39 


14.63 


16.59 


4.88 


66.67 


A AT AG A 


370 


24.05 


8.11 


14.86 


5.14 


36.67 


22.16 


6.49 


9.73 


6.49 


0.00 


25.95 


9.73 


20.00 


3.78 


61.11 


Average 






19.56 


20.22 


14.42 


28.09 




20.60 


22.47 


16.26 


20.75 




18.52 


17.96 


12.59 


34.17 



Average' denotes the weighted average of the corresponding column. 'Size' denotes the number of samples for the corresponding motif variant. 'Error rate' is the proportion of 
false results in the dataset, which equals one minus accuracy. 'False-negative rate' is the proportion of true poly(A) motifs that are predicted to be false, which equals one 
minus sensitivity. 'False-positive rate' is the proportion of false poly(A) motifs that are predicted to be true, which equals one minus specificity. 'Rel' denotes the relative 
improvement of HMM with respect to SPE. The lowest error rate for each motif variant is indicated in bold. PPK could not finish running within 48 h on AATAAA. 
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kernel compares favorably with the SPE kernel, with slightly 
higher false-negative rate and slightly lower false-positive rate. 
Our method, which simultaneously takes into account location 
information, sequence uncertainty and training labels, performs 
consistently and significantly better than PPK, SPE and WD. 
The PPK kernel has the worst results. As discussed in Section 
2, PPK can suffer from severe overfitting by fitting each sequence 
to a separate HMM and it discards important discriminative 
information by not using the training labels. 

Next, we compare the runtime of different methods using 
the largest two variants, A AT AAA and ATT AAA. Our 
method is significantly faster than alternatives at training 
time, while being comparable in speed at test time (Table 2). 
In this experiment, the training time is equal to the time for 
kernel (or feature) computation for training data plus that for 
learning SVM models; and the test time is equal to the time 
for kernel (or feature) computation for test data plus that for 
classification. At training time, PPK, SPE and WD need to 
compute a square kernel matrix of size D x D, and the asso- 
ciated SVM models need to be trained in the dual form. In 
contrast, our method computes a feature matrix of size 
D x 2mL, and the associated SVM model can be trained in 
the primal form (usually faster than in the dual form). At test 
time, PPK, SPE and WD need to compute the kernel values 
between each support vector (the number of support vectors 



can be large) and each test data point. In contrast, our 
method only computes a feature vector of length 2mL for 
each data point, and then performs an inner product with 
the learned SVM model w, a vector of length 2mL. 

4.4 Comparison to state-of-the-art method: the RF model 

Table 3 compares the proposed method with a state-of-the-art by 
5-fold cross-validation on the 12 variants of human poly(A) 
motifs: the RF model using domain- specific features by 
Kalkatawi et al. (2012). As shown in Table 3, our method is 
always significantly more accurate than RF (much lower error 
rates) and is more sensitive than RF on 1 1 out of the 12 variants. 
In fact, our method improves the error rates by 7-72% as com- 
pared with RF on the 12 motif variants. On average, our method 
has an improvement over RF in terms of the error rate, false- 
positive rate and false-negative rate by ~26, 15 and 35%, 
respectively. These percentages imply that the significant im- 
provement on the error rate is not just the result of a better 
trade off between sensitivity and specificity, but it is the result 
of being a better method in both senses. By comparing the results 
in Tables 1 and 3, it can be seen that the RF model outperforms 
other string kernels (PPK, SPE and WD) in terms of accuracy 
for poly(A) motif prediction. Our method that systematically 
extracts spectral latent features significantly improves upon RF 



Table 2. Runtime comparisons on two variants A ATA A A and ATT AAA for one train/test split, with k = 3 and all other parameters set to optimal 

Time (s) A AT AAA ATT AAA 

PPK SPE WD HMM PPK SPE WD HMM 

Training 46.16 37.38 7.59 2722.81 9.46 6.47 3.86 

Testing 6.81 0.94 1.43 674.08 1.54 0.69 0.67 

Note: PPK could not finish running within 48 h on AATAAA. The values in bold indicate better results. 
Table 3. Comparison of our method (HMM) with RF 

Variants Size Error rate (%) False-negative rate (%) False-positive rate (%) 

RF HMM Rel RF HMM Rel RF HMM Rel 



AATAAA 


5190 


20.06 


18.59 


7.31 


19.74 


18.54 


6.10 


20.37 


18.65 


8.44 


ATT AAA 


2400 


18.42 


16.21 


12.01 


18.68 


18.17 


2.75 


18.15 


14.25 


21.49 


AAAAAG 


1250 


16.64 


9.36 


43.75 


16.53 


11.36 


31.28 


16.75 


7.36 


56.06 


AAGAAA 


1230 


11.06 


5.45 


50.75 


11.92 


6.02 


49.53 


10.15 


4.88 


51.94 


TAT AAA 


880 


19.55 


15.34 


21.53 


18.10 


19.09 


-5.47 


20.87 


11.59 


44.46 


A AT AC A 


780 


19.36 


11.15 


42.39 


18.13 


15.64 


13.73 


20.49 


6.67 


67.46 


AGTAAA 


690 


27.83 


16.96 


39.07 


25.24 


20.00 


20.76 


29.92 


13.91 


53.50 


ACT AAA 


670 


22.09 


14.33 


35.14 


20.69 


20.60 


0.45 


23.36 


8.06 


65.50 


GAT AAA 


460 


20.00 


9.57 


52.17 


21.01 


10.43 


50.33 


18.92 


8.70 


54.04 


CAT AAA 


410 


18.54 


9.27 


50.01 


16.92 


14.63 


13.51 


20.00 


3.90 


80.49 


A AT AT A 


410 


24.88 


12.68 


49.02 


24.12 


20.49 


15.06 


25.59 


4.88 


80.94 


AATAGA 


370 


18.38 


5.14 


72.06 


19.37 


6.49 


66.51 


17.32 


3.78 


78.15 


Average 




19.19 


14.42 


25.62 


18.83 


16.26 


14.81 


19.48 


12.59 


35.40 



Note: The performance of both RF and HMM is evaluated on the same 5-fold cross-validation. 'Rel' denotes the relative improvement of HMM with respect to RF. The 
lowest value for each criterion of each motif variant is indicated in bold. 
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Fig. 1. Visualization of the importance of different dimers at different positions for the 12 variants of human poly(A) motifs. The x-axis gives the 
positions in the sequence. The y-axis lists all 16 possible dimers. The colors denote the levels of importance: the light green color for the positions 0-6 is 
the background color, which indicates that no effects differentiate true and false motifs; the darker the red, the more important the dimer at that position 
is to identifying true motifs; the darker the blue, the more important the dimer at that position is to identifying false motifs. 



and other string kernels on the same task. This supports our 
assumption that uncertainty and diversity information is import- 
ant for this problem and there is a 'clean' DNA signal hidden in 
the observed sequences. 



4.5 Visualizing importance scores of dimers and positions 

Another advantage of our method over previous state-of-the-art 
poly(A) motif predictors is that our method can be used to visu- 
alize the importance of /c-mers or positions to the prediction task. 
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Fig. 2. Visualization of the importance of different positions for the 12 motif variants. The x-axis gives the position in the sequence. For each k from 1 to 
5, the y-axis is the importance score of a position by summing over the absolute values of the importance for all possible /c-mers at that position 



This can provide researchers or users a direct and intuitive way 
to study the patterns and characteristics of DNA sequences. 
More importantly, most previous studies on poly(A) motifs try 
to reveal statistics of the surrounding regions of true motifs. Our 
method, in contrast, can reveal patterns for false motifs at the 
same time. 

In Figure 1, we present the importance scores a(dimer@t) for 
dimers (subsequences of 2 nt, e.g. AG) in determining whether or 
not a candidate poly(A) motif is a true motif. A number of 
interesting observations can be made about this figure: 

AA is an informative subsequence to differentiate true poly(A) 
motifs from false ones in all 12 variants of human poly(A) 
motifs. When AA appears frequently within 30 nt downstream 
of the candidate poly(A) motif, this strongly suggests that it is a 



true poly(A) motif. This is expected because the mRNA cleavage 
site is often 15-25nt downstream of the poly(A) motif, and the 
region between the cleavage site and the motif is known to be A- 
rich (Retelska et aL, 2006). Interestingly, when A A appears fre- 
quently within 100 nt upstream of the candidate motif, the can- 
didate is likely to be a false motif. 

CG is an interesting dimer. Its positions carry important in- 
formation for determining both true and false poly(A) motifs. 
When CG appears beyond 40 nt upstream of the motif, this sug- 
gests that the motif is a true poly(A) motif. On the other hand, 
CG serves as a strong sign for false predictions in the immediate 
20 nt downstream and then becomes a sign for true predictions 
for further downstream sequences. In Hu et al. (2005), it was 
found that -100/-41 and +41/+ 100 regions of poly(A) motifs 
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are C- and G-rich regions. Our results suggest that looking at CG 
together rather than individually may capture more informative 
patterns. 

TA or AT is one of the characteristics for false poly(A) 
motifs among all 12 variants. No matter if it appears frequently 
in downstream or upstream nt sequences of the candidate 
poly(A) motif, TA or AT suggests that the candidate is a false 
one. Between them, TA is more informative than AT io specify 
false motifs. On the one hand, our findings coincide with previ- 
ous studies that TA and AT are important features around 
the poly(A) motifs and TA is more frequent than AT [e.g. the 
TAT AT A oligonucleotide is more over-represented than the 
AT AT AT oligonucleotide (Hu et al., 2005; van Helden et al., 
2000)]. On the other hand, our findings reveal that TA and 
AT appear much more often in sequences around the false 
motifs than around the true motifs. In van Helden et al. 
(2000); Hu et al. (2005), it was found that TAT AT A 
and AT AT AT are the most over-represented oligonucleotides 
downstream of the poly(A) motifs. However, by analyzing the 
negative motifs, our results imply that although TA and AT 
appear often in positive motifs, they appear even more often in 
negative ones. 

TG can determine false motifs at alternate positions upstream 
or downstream of the candidate motif of A AT AG A (Fig. 1(1)). 
This is not the case for the two most frequent motifs, A AT AAA 
and ATT AAA, which partially supports our hypothesis that the 
intrinsic characteristics of the frequent motifs and rare motifs are 
different. Thus, a good poly(A) motif predictor should have dif- 
ferent models for different motif variants. 

Figure 2 shows the importance scores for different positions 
with k from 1 to 5. Again, the 6nt positions for the candidate 
motifs offer no information to the prediction. However, because 
the £-mers overlapping with the motif regions contain subse- 
quences of the motifs, the motif regions do not have absolutely 
'zero' importance. Again, we list key observations here: 

The longer the subsequences are (bigger k), the smoother the 
importance curves are. This is expected because considering 
more nt at the same time will average the effects caused by in- 
dividual positions. 

In almost all the variants, the 50 nt downstream of the candi- 
date motifs are informative. Specifically, motif variants 
A AT A A A, CAT AAA and A AT AT A have important information 
at the positions around the 25th nt downstream of the candidate 
motifs (Fig. 2a, j and k). This coincides with the fact that the 
mRNA cleavage site is 15-25nt downstream of poly(A) motifs 
(van Helden et al, 2000; Retelska et al, 2006). 

5 CONCLUSION AND FUTURE WORKS 

In this article, we proposed a novel method to extract fea- 
tures from upstream and downstream regions of candidate 
poly(A) motifs in human DNA sequences. Our proposed spectral 
latent feature-based method achieves state-of-the-art results. 
The proposed method systematically explores the informa- 
tion encoded in nucleotide sequences by learning sequence 
dynamics and matching latent distributions on each position, 
and it can be easily extended to visualize the importance of 
subsequences and positions, thus providing a general method 
for sequence-based classification problems in bioinformatics. 



Our method can be directly applied to other sequence classi- 
fication problems and achieve state-of-the-art results, such 
as the transcription start-site prediction and splice- site 
prediction (Supplementary Material SI and Supplementary 
Table SI). 

Our method, currently, requires a fixed length of upstream and 
downstream sequences for the training and testing data. Such 
prior knowledge has to be given as input. We are trying to gen- 
eralize and extend our method to take varying lengths of se- 
quences for different samples. Furthermore, there may be 
longer-range dependency between the latent variables, and 
HMMs of higher orders may be needed for the feature extraction 
purpose, for which junction tree-type algorithms can be applied 
(Parikh et al, 2012). Similar to the WD kernel with shifts 
(Ratsch et al, 2005), our method can also be straightforwardly 
extended to take shifted matches into account. 
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